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AN  ALTERNATIVE  TO  CORRESPONDENCE  ANALYSIS  USING 

HELLINGER  DISTANCE 

C.  Radhakrishna  Rao 


Abstract.  In  this  paper,  a  general  theory  of  canonical  coordinates  is  developed  for  reduction  of  dimen¬ 
sionality  in  multivariate  data,  iissessing  the  loss  of  information  and  plotting  higher  dimensional  data 
In  two  or  three  dimensions  for  visual  displays.  The  theory  is  applied  to  data  in  two  way  tables  with 
variables  in  one  category  and  samples  (individual  or  populations)  in  the  other.  Tlie  method  is  applicable 
to  data  with  continuous  measurements  on  the  variables  as  well  as  to  frequencies  of  attributes.  An  al¬ 
ternative  to  the  usual  correspondence  analysis  of  contingency  tables  based  on  Hellinger  rather  than  the 
chisquare  distance  is  suggested.  The  new  method  has  some  attractive  features  and  does  not  suffer  from 
some  inherent  drawbacks  r(;sulting  from  the  use  of  the  chi-square  distance  and  variable  sample  sizes  for 
the  populations  in  the  correspondence;  analysis.  'The  t(;chnique  of  biplots  where  the  populations  and  the 
variables  arc  represented  on  the  same  chart  is  discussed. 


1.  Canonical  Coordinates 

The  concept  of  canonical  variates  (coordinates)  was  introduced  in  an  early  paper  by  the  author  (Rao 
(1948))  for  graphical  representation  of  taxonomical  units  characterized  by  multiple  measurements.  This 
was,  perhaps,  the  first  attempt  to  reduce  high  dimensional  data  to  two  or  three  dimensions  using  an 
objective  criterion  for  purposes  of  graphical  displays.  Since  then,  graphical  representation  of  multivariate 
data  for  visual  examination  of  clusters,  outliers  and  other  structures  in  the  data  has  been  an  active  field 
of  research.  Some  of  the  developments  arc  biplots  (Gabriel  (1971),  Gifi  (1990),  Nishisato  (1980),  Gower 
(1993),  Gieenacre  (1993)),  multidimensional  scaling  (Kruskal  and  Wish  (1978)),  correspondence  analysis 
(Benzecri  (1992),  Grcenacre  (1984)),  Chernoff’s  faces  (Chernoff  (1973))  and  parallel  coordinates  (Maha- 
lanobis,  Mazurndar  and  Rao  (1949),  Wegman  (1990)).  Cavalli-Sforza  (1991)  uses  canonical  coordinates 
(variables)  in  interpreting  the  evolution  of  human  populations. 

The  object  of  the  present  paper  is  to  briefly  review  the  concept  of  canonical  coordinates  as  originally 
introduced  in  1948  and  later  elaborated  in  Rao  (1964,  1979,  1980,  1985)  in  the  light  of  modern  develop¬ 
ments  and  present  an  alternative  to  the  current  practice  of  correspondence  analysis,  which  seems  to  have 
some  attractive  properties. 

In  Section  2  we  consider  the  general  problem  of  transforming  the  points  of  a  /^dimensional  vector 
space  endowed  with  a  specified  inner  product  to  a  lower  dimensional  Euclidean  space  with  the  usual 
definition  of  inner  product  and  distance.  The  solution  to  the  problem  is  considered  in  a  more  general 
set  up  than  what  is  possible  through  the  use  of  Eckart  and  Young  (1936)  theorem.  In  Section  3,  some 
measures  are  introduced  to  fLssess  the  loss  of  information  in  reduction  of  dimensionality.  The  role  of 
biplots  and  their  interpretation  are  also  discussed.  An  alternative  to  correspondence  analysis  applied  to 
contingency  tables  based  on  Hellinger  rather  than  the  chisquare  distance  is  given  in  Section  4. 

It  is  argued  that  the  chisquare  distance  used  in  correspondence  analysis  is  not  an  intrinsic  measure 
of  the  difference  between  two  given  population  distributions  cis  it  depends  to  some  extent  on  the  whole 
set  of  populations  considered  in  the  study,  and  also  on  the  sample  sizes  available  for  the  estimation  of 
population  distributions.  In  such  a  Ccise,  the  configuration  of  a  subset  of  the  populations  as  revealed  by 
correspondence  analysis  may  depend  on  what  other  populations  are  included  in  the  analysis.  An  example 
is  given  to  show  how  anomalies  can  arise  in  correspondence  analysis  based  on  the  chisquare  distance.  On 
the  other  hand  no  such  anomalies  arise  with  the  use  of  Hellinger  distance. 


1991  Mathematics  Subject  Classijlcation.  921130,  G2H17. 

Key  words  and  phrases.  Canonical  coordinates,  Chisquare  distance.  Contingency  tables,  Correspondence  analysis, 
Hellinger  distance,  Matrix  approximation,  Principal  component  analysis. 
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2.  Reduction  of  Dimensionality 

The  problem  we  consider  may  be  stated  as  follows.  Let  X  =  (Xi  :  •  -  •  :  be  a  p  x  m  data  matrix, 

with  the  z-th  column  vector  Xi  representing  measurements  of  p  variables  made  on  the  z-th  population 
(individual  or  unit).  The  column  vector  Xi  will  be  referred  to  as  the  z-th  population  profile  (PP).  The 
PP  s  can  be  represented  as  zn  points  in  a  p-dimensional  vector  space  HP  with  a  specified  inner  product 
and  the  associated  norm 


(x,  y)  -  x'My,  x,  y  G  (2.1) 

||x||  =  (x,x)^/^  X  G (2.2) 

where  M  is  a  positive  definite  matrix.  We  may  call  this  the  Mahalanobis  or  M-space.  In  practical 
situations,  it  may  be  necessary  to  attach  a  weight  Wi  >  0  to  the  z-th  PP,  the  exact  use  of  which  will  be 
detailed  in  the  following  discussion.  We  represent  the  vector  (zy;i, . . .  ,zz;„i)'  by  lu  and  the  diagonal  matrix 
with  71} i  as  the  z-th  diagonal  (dement  by  W.  The  M-space  with  weight  as  an  additional  dimension  will  be 
referred  to  iis  WM-spaco.  [In  our  treatnuuit  wo  consider  W  ris  a  general  positive  definite  matrix  to  cover 
more  gciuiral  apj)licationsj. 

The  problem  is  to  find  a  k  x  zn  matrix 

(2.3) 

with  k  <  p  for  representing  the  PP\s  in  a  A:-(limensional  Euclidean  space  (E^)  with  the  usual  inner 
product,  x'y  for  x,y  e  and  the  /c-vector  ¥{  iis  the  profile  of  the  z-th  population,  in  such  a  way  that 
the  relative  positions  of  the  PP’s  in  the  M-space  (in  terms  of  distances  between  profiles)  are  preserved 
to  the  extent  possible  in  E^ .  For  this  purpose,  we  need  to  have  a  criterion  for  measuring  the  loss  of 
information  in  reducing  the  dimension  of  the  profile  space,  by  minimizing  which  we  obtain  an  optimum 
solution  for  (2.3). 

The  relative  positions  of  the  PP’s  in  the  A/-spacc  can  be  described  by  what  may  called  a  configuration 
matrix 

C  =  (X  -  a')'M(X  -  ^1')  =  ((Xi  -  -  0)  =  {Cij)  (2.4) 

where  ^  is  some  chosen  refereiicx!  (|)roHlc)  v(!ctor  aiKl  the  Cjj’s  r(;presetit  tiie  distances  and  angles  between 
profiles. 

The  corresponding  configuration  about  the  origin  in  the  reduced  space  is  Y'Y .  The  problem  then 
reduces  to  minimizing 


lie  -  r'ni  (2.5) 

with  respect  to  i  ,  a  A:  x  zn  matrix  as  defined  in  (2.3),  for  a  suitably  chosen  matrix  norm.  The  following 
theorem  proved  in  Rao  (1979,  1980,  1985)  provides  the  solution. 

Theorem  1.  Consider  the  s.v.d.  (singular  value  decomposition) 

M^l\x  -  +  . . .  +  (2.6) 

with  singular  values  A,  >  A2  >  . . .  >  Ap,  where  and  are  symmetric  square  roots  of  M  and 

W .  Then  the  choice 


or  conventionally  written  in  the  transposed  form 


(2.7) 


XiW-'/-VuX2W-^/'^V2,...  (2.8) 

where  the  components  of  the  i-th  m-vector  are  the  i-th  canonical  coordinates  (i.e.,  the  coordinates  in  the  i- 
th  dimension  of  the  reduced  space)  for  the  different  populations,  minimizes  (2.5)  for  any  {W,  W)-invariant 
norm  as  defined  in  Note  2.1.  We  call  these  coordinates  the  canonical  coordinates  for  populations  (CCP). 
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Note  2.1.  ( A,  ^)-invariant  norm  of  an  m  x  n  matrix  is  the  usual  norm  (satisfying  the  postulates  of 
a  norm)  with  the  additional  property 

11^  •  D\\  -  II  •  II  for  any  C,D  such  that  C'AC  =  A,  D'BD  =  B  (2.9) 

where  C  is  an  m  x  m  matrix,  D  is  an  n  x  n  matrix,  and  A  and  B  are  positive  definite  matrices  of  orders 
m  and  n  respectively.  This  is  a  generalization  given  in  Rao  (1980)  of  a  unitarily  invariant  norm  defined 
by  von  Neumann  (1937)  with  A  and  B  as  unit  matrices. 

Note  2.2.  In  our  applications,  we  indicate  some  choices  of  the  reference  vector  However,  we  note 
that  a  further  minimization  of  (2.5)  with  respect  to  ^  leads  to  the  choice 

|  =  (TfFl)-'Xiyi  (2.10) 

where  1  is  the  column  vector  of  unities. 

Note.  2.3.  Using  the  notation 

.'V(j)  =  Diag(A . ,Ai) 

f/(i)  =  (t/i  V(,)=(U,  :...:U0 

we  may  write  the  solution  Y  given  in  (2.7)  in  the  concise  form 

=  (2.11) 

Note  2.4-  In  the  expression  (2.G),  a  symmetric  square  root  of  a  positive  definite  matrix  is  used.  It 
can  be  computed  in  a  simple  way  fis  follows.  If  A  is  a  positive  definite  matrix  of  order  p  with  the  spectral 
decomposition 

A  =  =  QA'^Q' 

where  Q  -  {Qi  :  :  Qp),  then 

A'/-  =  'Z\AhQ\  =  QAQ' 

=  E(Ai)-‘Q.g'-fM~‘Q'  (2.12) 

We  may  look  at  the  i)rol)lem  in  a  slightly  different  way  by  defining  what  is  called  the  dispersion 
matrix  between  profiles 

B  =  {X  -  ^l')WiX  -  41')'  =  (bij)  (2.13) 

where  ba  is  the  weighted  variance  of  the  7-th  variable  and  bij  is  the  weighted  covariance  between  the  z-th 
and  j-th  variables  across  the  profiles.  Consider  an  approximation,  Zi  G  to  (Xj  — ^),  with  the  restriction 
that  Zi , , . .  ,  Zm  lie  in  a  A:  dimensional  subspace  of  /?^,  in  which  Ccise  we  have  the  representation 

Z  =  (Zi  :  . . .  :  Zm)  =  AC  (2.14) 

where  A  is  a  p  x  A:  matrix  whose  columns  span  the  subspace  and  C  is  a  A:  x  rri  matrix.  Without  loss  of 
generality  we  may  choose  A  to  satisfy  the  condition  A^MA  =  I  (i,e.,  the  columns  of  A  are  orthonormal  in 
the  M-space).  The  dispersion  matrix  between  profiles  in  the  reduced  space  is  ACWC'A',  and  we  choose 
A  and  C  such  that 


\\D-ACWC'A'\\  (2.15) 

is  a  minimum  for  an  appropriate  norm  of  the  matrix.  The  solution  is  given  in  Theorem  2,  which  is  proved 
on  the  same  lines  as  in  Theorem  1. 


Theorem  2.  Consider  the  same  s.v.d.  as  in  Theorem  1 

A/‘/''(.v  -  41')  w'/-  =  A,f/,u/  + . . .  +  x„u,y;. 

Then  the  optimum  choice  of  AC  which  juinimizes  (2.15)  for  any  {M M) -invariant  norm  is 

Aiqjt)  =  Af-‘/'^(A,UiUi'  +  . . .  +  AA£/fcUfc')W-'/2 


(2.16) 


4 


C.  RADHAKRISHNA  RAO 


where  the  suffix  (k)  is  introduced  to  mdicate  the  dimension  of  the  reduced  space. 


We  may  choose 


A 


C(k} 


(2.17) 


Note  2.5.  We  may  represent  the  profiles  by  plotting  the  columns  of  C(k)  in  a  fc-dimensional  Euclidean 
space,  which  is  the  same  solution  as  that  obtained  in  Theorem  1. 

A  geometric  approach  to  the  problem  of  reduction  of  dimensionality  is  to  fit  a  fc-dimensional  plane 
to  the  data.  A  set  of  m  points  on  a  A:-plane  can  be  written  as 


+  (2.18) 
where  A  is  c\  p  x  k  matrix  and  C  is  a  A:  x  rn  matrix.  We  determine  A,  C,  ^  such  that 

(2.19) 

is  a  minimum  for  a  suitably  chosen  norm.  The  solution  is  given  in  Theorem  3,  which  is  proved  on  the 
same  lines  as  in  Theorems  1  and  2. 


Theorem  3.  Consider  the  same  s.v.d.  as  in  Theorem  L  Then  the  choices  of  A  and  C  as  in  Theorem 
2  and  ^  =  {VW\)  ^  XW\  as  in  (2.10)  minimize  any  (A/,  W)-invariant  norm  of  (2.19). 

Note  2.6.  We  may  also  look  at  the  problem  in  some  other  ways.  Let  T  be  a  A:  x  p  matrix  providing  a 
transformation  of  the  column  vectors  of  A"*  to  T  =  7* X  in  a  Ar-dimensional  space  with  the  induced  inner 
product  matrix  TM~^T\  The  squared  distance  between  the  z-th  and  j-th  profiles  is 

Dl  =  (Xi  -  XjYMiXi  -  Xj )  (2.20) 

in  the  full  space,  and 

=  (W  -  XjYT'iTM-'T'r^TiXi  -  Xj)  (2.21) 

in  the  reduced  space.  By  definition  <  Dfj.  We  may  then  choose  T  by  minimizing  some  function 

of  the  differences  or  ratios  of  D’f.  and  D'^ 

ij  tj{k) 

One  of  the  functions  suggested  in  Rao  (1948)  was  the  difference  in  the  weighted  sum  of  all  possible 
differences 

(2.22) 

which  leads  to  the  same  solution  for  Y  =  TX  iis  in  Theorems  1,  2  and  3. 

Another  method  is  to  choose  T  by  maximizing  the  minimum  of  over  all  i  and  j  as  suggested 

by  Eslava-Gornez  and  Marriott  (1993),  or  by  maximizing  the  minimum  of  the  ratios  Both 

these  methods  are  computationally  v(Ty  complex,  but  can  be  managed  when  p  is  small. 

Note  2.7.  The  choices  of  M  and  W  as  inputs  in  the  analysis  for  canonical  coordinates  need  some 
discussion.  The  choice  of  M  is  related  to  the  distance  measure  between  profiles  appropriate  to  a  given 
investigation.  In  taxonornical  classification,  A/  is  generally  chosen  as  the  inverse  of  the  variance-covariance 
(dispersion)  matrix  of  the  measurements  on  units  within  taxa  leading  to  Mahalanobis  (1936)  distance  (see 
R.ao  (1945,  1947)).  The  matrix  W  is  taken  to  be  diagonal  with  the  z-th  diagonal  element  Wi  proportional 
to  the  number  of  individuals  sampled  from  the  z-th  taxa  to  estimate  its  profile.  For  a  chosen  Af,  the 
configuration  of  the  profiles  in  the  reduced  space  will  depend  on  VF,  but  is  likely  to  be  robust  provided 
the  luTs  are  not  widely  different.  In  the  study  reported  in  Rao  (1948),  all  the  wTs  were  chosen  as 
equal  although  the  sample  sizes  for  different  populations  were  different.  However,  the  choice  of  wTs  as 
proportional  to  sample  sizes  enables  us  to  test  hypotheses  on  goodness  of  fit  of  lower  dimensional  planes 
to  the  observed  profiles.  For  details,  the  reader  is  referred  to  Rao  (1973,pp.  556-560,  1985). 

If  we  desire  that  the  configuration  of  a  subset  of  profiles  to  be  better  preserved  in  the  reduced  space 
than  the  others,  then  we  have  to  give  bigger  weights  to  those  profiles. 
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Note  2.8.  In  many  situations  we  have  a  data  matrix  giving  the  measurements  of  p  variables  made 
on  in  individuals  without  any  further  information  to  guide  us  in  the  choices  of  the  M  and  W  matrices. 
In  such  cases,  the  usual  choices  of  M  and  W  are  the  unit  matrices  and  the  resulting  canonical  coordinate 
analysis  is  the  Principal  Component  Analysis  (PCA)  introduced  by  Hotelling.  Some  characterizations  of 
the  principal  components  and  their  applications  are  given  in  papers  by  Rao  (1958,  1964,  1987).  It  is  also 
the  practice  to  apply  PCA  on  CX,  i.e.,  after  a  suitable  scaling  of  the  measurements.  One  choice  of  C  is  a 
diagonal  matrix  with  the  2-th  diagonal  element  Ci  =  where  su  is  the  2-th  diagonal  element  of  the 

matrix 

{X  -X1%X  ^Xiy.  (2.23) 

This  procedure  is  equivalent  to  using  the  canonical  coordinate  analysis  choosing  M  =  C  and  W  =  I. 
Another  possibility  which  has  not  hvxm  considered  before  is  the  choice,  a  =  llvrii  where  lUi  is  a  measure 
of  location  such  as  the  mean  or  median  of  the  measurements  on  the  2-th  variable. 

Note  2.9.  A  more  general  problem  not  considered  in  this  paper  is  as  follows.  The  basic  space  is 
somewhat  general  with  a  specified  nonnegative  proximity  index  between  any  two  points.  Given  a  set  of 
points  with  tlie  matrix  of  proximity  indices  between  points,  the  problem  is  to  transform  the  points  to 
a  low  dimensional  Euclidean  space  such  that  the  inequality  relationships  betweem  proximity  indices  are 
maintained  to  the  extent  possible  in  the  corresponding  Euclidean  distances.  Such  a  transformation  is 
achieved  through  the  algorithm  for  multidimensional  scaling  as  developed  by  Kruskal  and  Wish  (1978). 


3.  Loss  of  information 

The  representation  of  the  PP’s  in  a  lower  dimensional  space  will  entail  some  loss  of  information 
depending  on  the  object  of  statistical  analysis.  However,  we  provide  some  general  criteria  for  assessing 
the  amount  of  distortion  in  the  configuration  of  the  profiles  due  to  reduction  of  dimensionality. 

In  Theorems  1  and  2  of  Section  2,  it  is  shown  that  the  best  approximation  to  A"*  in  the  reduced  space 
is 

X  =  ^1'  +  (3.1) 

SO  that  the  matrix 

D,  =  X  -  X  =  +  . . .  +  ApC/pV;)17-‘/2  (3.2) 

gives  a  complete  account  of  the  errors  in  individual  profiles  due  to  reduction. 

The  configuration  of  the  profiles  in  the  reduced  space  is 

=  (3.3) 

SO  that  the  matrix 

D2  =  C(p)  -  + . . .  +  (3.4) 

measures  the  distortion  in  the  configuration,  where  C7(p)  =  C  as  defined  in  (2.4).  An  overall  (weighted) 
measure  of  loss  of  information  is  the  ratio  of 

trace  =  A|+i  +  . . .  +  Ap,  (3.5) 

to  the  total  variation  (A'f  +  . . .  +  A^,  which  can  be  written  as 

(3-6) 

1  1 

It  is  more  important  to  assess  the  distortions  in  the  inter  profile  squared  distances.  The  matrix  of 
these  squared  distances  denoted  by  5  can  be  computed  from  the  configuration  matrix  O  using  the  formula 

5  =  cl'  4-  Ic  -  2C  (3.7) 

where  c  is  the  vector  of  the  diagonal  elements  of  C.  The  corresponding  matrix  in  the  reduced  space  is 

^(k)  =  <^'(k)^'  +  -  2C{k)  (3.8) 
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SO  that  the  matrix 

Ds  ^  S  -  S^k)  =  (dij)  (3.9) 

measures  the  deficiencies  in  the  distances  due  to  reduction  of  dimensionality.  An  over  all  measure  of 
deficiency  is 

T,llwiWjd*j  =  +  . . .  +  Ap  (3.10) 

which  is  the  same  as  in  (3.5). 

The  dispersion  matrix  between  profiles  in  the  whole  space,  as  introduced  in  (2.13)  is 

B  =  (.Y  -  ^r)w{x  ^  =  (bij)  (3.11) 

while  the  corresponding  matrix  in  the  reduced  /c-dimensional  space  is 

)[/(', =  (6, (3.12) 

The  proportion  of  the  betwcKm  profile  variance  in  the  z-th  variable  explained  by  the  first  k  canonical 
variates  (coordinates)  is 

^hi{k)  / i  —  1,  .  .  .  ,p.  (3.13) 

For  an  interpretation  of  the  canonical  coordinates  in  different  dimensions  it  would  be  useful  to 
compute  the  proportion  of  variance  in  each  variable  explained  by  each  of  the  canonical  variates,  i.e., 
to  obtain  a  decomposition  of  (3.13)  in  terms  of  canonical  variates.  For  this  purpose,  we  introduce  the 
matrices 

E,  =  =  (3.14) 

E-i  =  ieij/Vb~i)  =  {fa)  (3.15) 

where  ha  is  as  defined  in  (3.11).  Let  be  the  matrix  obtained  by  retaining  only  the  first  k  columns 
in  Ei  for  z=l,2.  Then  it  is  seen  that 

EiE[  =  D,  (3.16) 

Let  us  consider  the  matrix  and  define  what  may  be  called  canonical  coordinates  for  variables 
(CeV)  in  k  dimensions  as  follows. 


Table  1.  Canonical  coordinates  for  variables 


variable 

dim  1 

dim  2 

dim  k 

1 

eii 

ei2 

e\k 

2 

eoi 

^22 

(^2k 

P 

^p2 

^'pk 

If  we  plot  the  variables  as  points  in  E^  using  the  row  coordinates  in  different  dimensions,  then  the 
scalar  products  of  the  vectors  representing  the  variables  are  the  elements  of  the  best  A:-dimcnsional 
approximation  to  B, 

There  is  some  advantage  in  plotting  the  variables  using  the  standardized  coordinates  (fij)  defined  in 
(3.15)  as  shown  in  Table  2. 


The  magnitudes  in  the  right  hand  block  of  Table  2  indicate  the  influence  of  different  variables  in 
each  dimension  (canonical  variate)  in  the  reduced  space.  This  may  enable  us  to  associate  each  dimension 
with  certain  variables. We  may  plot  the  variables  using  the  standardized  CCV’s  in  the  same  chart  as  the 
canonical  coordinates  for  the  profiles.  It  is  seen  that  all  variable  points  lie  inside  the  unit  sphere  in  E^, 
and  the  variables  close  to  the  surface  of  the  sphere  have  greater  influence  on  the  canonical  variates. 

It  may  also  be  mentioned  that  it  is  the  usual  practice  in  a  biplot  to  represent  the  z-th  variable  as  a 
directed  line  using  the  direction  cosines  proportional  to  the  z-th  row  elements  in  the  matrix 

Enk)  =  :...:Uk) 


(3.17) 


AN  ALTERNATIVE  TO  CORRESPONDENCE  ANALYSIS  USING  HELLINGER  DISTANCE 
Table  2.  Standardized  CCV’s  and  the  variance  explained  by  each  canonical  variate 
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Standardized 

Proportion  of  variance 

Variable 

coordinates 

explained 

dim  1 . 

. ,  dim  k 

dim  1 . . .  dim  k 

total 

1 

/ii ' 

•  •  fik 

f2  fl 

P 

/pi. 

•  •  fpk 

f2  f2 

/pi  •  •  •  Jpk 

in  which  ciise  the  projections  of  a  profile  point  in  these  directions  are  proportional  to  the  approximate 
coordinates  of  the  profile  in  tlie  original  space  (see  Gabriel  (1971)  and  Grcenacre  (1993)). 

Note  3.1.  We  may  consider  tlu!  k  columns  in  Table  1  of  the  CCV’s  jis  k  points  in  the  />-dimcnsional 
variable  space.  These  points  were  termed  as  typical  projiles  in  Rao  (1964),  in  the  sense  that  the  variance- 
covariance  matrix  of  the  variables  comf)uted  from  them  provides  the  best  approximation  to  that  computed 
from  all  the  original  profiles. 

Note  3.2.  The  standardized  CCV’s  are  not  the  coordinates  for  row  profiles.  They  are  used  for 
interpreting  the  CC’s  of  column  profiles.  If  a  representation  of  row  profiles  is  needed,  we  consider  the 
matrix  A'  with  appropriate  choices  of  the  M  and  W  matrices  (which  may  be  different  from  those  used 
for  column  profiles)  and  repeat  the  analysis  indicated  in  (2.6)-(2.8). 


4.  Application  to  two  way  contingency  tables 

We  consider  dichotomous  categorical  data  with  s  rows  and  m  columns  and  uij  observations  in  the 
(f,j)-th  cell.  Define 


iTi  s  s  m 


N 

=  {nij),ni 

.  =Y^nij,  71, j  = 

1  1 

R 

=  Diag  {n 

i./ti,  . . .  .rts./n), 

C  =  Diag  (n.i/ii, ... 

/Pill 

Pl|m\ 

P 

=  n~^NC 

.  ... 

column  profiles 

\P,|1  ••• 

Ps\m  j 

Q 

Qm\  1  \ 

= 

.  ... 

row  profiles 

V'7i|,  •  ■  • 

(/m|s  / 

(Pl,---  .PaY  -P  =  Rl,  q  =  ci  =  {qi,...  ,r/m)'. 


(4.1) 


(4.2) 


The  problem  is  to  represent  the  column  (row)  profiles  as  points  in  <  s,  such  that  the  Euclidean 
distances  between  points  reflect  specified  affinities  between  the  corresponding  column  (row)  profiles. 

The  technique  developed  for  this  purpose  by  Benzecri  (1992)  is  known  ns  correspondence  analysis 
(CA)  which  can  be  identified  as  canonical  coordinate  analysis.  For  instance,  for  representing  the  column 
profiles  by  this  method,  one  chooses 


A  =  P,  M  =  R^\  W  =  C  (4.3) 

and  applies  the  analysis  described  in  Theorem  1  (equation  (2,6)).  Thus  one  finds  the  s.v.d.  of 

^\yUiV(  +  ...  +  \,UsV:  (4.4) 

giving  the  coordinates  for  the  column  profiles  in  P* 


(4.5) 
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where  the  components  of  z-th  vector  are  the  coordinates  of  the  profiles  in  the  i-th  dimension.  The 
standardized  canonical  coordinates  in  for  the  rows,  as  described  in  (3.15),  obtained  from  the  same 
s.v.d.  as  in  (4,4)  are 

,XkA-^R^/^Uk  (4.6) 

where  the  components  of  the  z-th  vector  are  the  coordinates  of  the  rows  in  the  z-th  dimension  and  A  is 
a  diagonal  matrix  with  the  z-th  diagonal  element  as  the  square  root  of  the  z-th  diagonal  element  of  the 
matrix 


+  +  =  {P-pl')C{P -piy.  (4.7) 


The  coordinates  (4.6)  do  not  represent  the  row  profiles  but  are  useful  in  interpreting  the  different  dimen¬ 
sions  of  the  column  profiles.  The  coordinates  for  representing  the  row  profiles  in  correspondence  analysis 
are  given  in  (4,11). 

Implicit  in  this  analysis  is  the  choice  of  measure  of  affinity  between  the  z-th  and  j-th  profiles  as  the 
squared  distance  (with  pi, . , .  ,7;.^  Jis  defined  in  (4.2)) 

{Ps\i  ““  Pa\j)“ 


4  = 

^  Pi 


Ps 


(4.8) 


which  is  the  chisquarc  distance.  The  squared  Euclidean  distance  in  ,  the  reduced  space,  between  the 
points  representing  the  z-th  and  j-th  profiles  is  an  approximation  to  (4.8).  Thus  the  clusters  we  see  in 
the  Euclidean  representation  is  based  on  the  affinities  as  measured  by  the  chisquare  distance  (4.8). 

Why  should  one  choose  the  chisquare  distance  to  measure  the  affinities  between  profiles?  Some  of 
the  advantages  mentioned  by  Benzecri  and  Greenacre  are  as  follows. 

1.  Note  that  the  expression  in  (4.4) 


R-'/2(P  -  pi')C''/2  =  /i‘/2((5  _  =  T  (4.9) 

so  that  if  we  need  a  representation  of  the  row  (.is  population)  profiles  in  E^,  we  use  the  same  s.v.d.  as 
in  (4.4) 

-  u/)c-'/-  =  XiUiv;  +  ...  +  x,UsV^  (4.10) 

leading  to  the  row  (population)  coordinates 

(Ai/?-'/-f/i  ...  :XkR-'/'^Uk)  (4.11) 

SO  that  no  extra  computations  are  needed  if  we  want  a  representation  of  the  row  profiles  also.  In 
correspondence  analysis  it  is  customary  to  plot  the  points  (4.5)  and  (4.11)  in  the  same  chart.  Then 
the  standardized  coordinates  for  the  columns  (cis  variables)  are 

AiAf'C'/Vi,...  ,AfcAr‘C7‘/Vfc  (4.12) 


where  Ai  is  the  diagonal  matrix  with  the  z-th  diagonal  element  as  the  square  root  of  the  z-th  diagonal 
element  of  (Q  -  lq'yR{Q  -  Ir/). 


2.  It  is  easy  to  see  that 


n(Ai  -h  . . .  -f  A]t) 


n  trace  TT\  with  T  as  in  (4.9) 


EE 


{uij  -  npiqj)'^ 
iippij 


which  is  the  Pearson  chisquarc  statistic  for  testing  independence  between  the  attributes  in  a  contingency 
table.  Thus  the  computations  involved  in  CA  automatically  allow  us  to  test  for  independence,  and  also 
tests  for  the  dimensionality  of  the  space  of  profiles  using  statistics  of  the  type 

n(Aif  -h  . . .  +  Aj^),z  =  1,2,. . .  (4.13) 


as  discussed  in  Rao  (1973,  pp.  556-5G0). 


3.  CA  is  only  an  exploratory  data  analysis  to  examine  the  configuration  of  row  and  column  profiles 
in  a  general  way,  so  that  a  particular  convenient  choice  of  the  distance  measure  can  serve  the  purpose. 
On  the  other  hand,  there  seem  to  be  some  drawbacks  in  using  the  chisquare  distance. 


AN  ALTERNATIVE  TO  CORRESPONDENCE  ANALYSIS  USING  HELLINGER  DISTANCE 
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1.  The  chisquare  distance  (4.8)  is  not  a  function  of  the  i-th  and  7-th  column  profiles  only.  It  involves 
the  marginal  profile  which  is  a  weighted  average  of  the  individual  column  profiles.  The  weights  depend  on 
the  observed  numbers  of  individuals  in  the  column  categories.  These  numbers  may  not  have  any  relevance 
to  the  problem  under  study,  especially  when  the  columns  represent  different  populations  from  each  of 
which  some  individuals  are  chosen  and  classified  according  to  row  categories.  In  such  a  case  the  marginal 
profile  depends  on  the  actual  sample  sizes  chosen  or  realized  for  different  populations.  The  examples 
discussed  in  the  sequel  show  that  the  derived  configurations  in  the  reduced  space  may  be  sensitive  to  the 
sample  numbers. 

2.  The  marginal  profile  depends  on  the  set  of  populations  included  in  CA.  The  CA’s  based  on  a  given 
set  of  populations  (5i)  and  an  extended  set  of  populations  {Si.S-z)  may  provide  different  configurations 
to  the  subset  S{ . 

3.  There  is  no  particular  advantages  in  ])lotting  the  row  and  column  profiles  in  the  same  chart.  Indeed 
one  could  use  different  distance  measures  for  column  and  row  profiles  and  study  configurations  of  the 
column  and  row  profiles  separately. 

4.  Since  the  chisciuare  distance'  uses  the'  marginal  pr()portie)ns  in  the  eienominator,  undue  emphasis 
is  givesn  to  the  catege)ries  with  low  freseiuesncies  in  mesasuring  affinitiess  bestwesen  profiles. 

.4n  alternative  to  the  chisepiare  elistance  which  has  some  advantages  is  the  Hellinger  Distance  (HD) 
between  the  i-th  and  j-th  column  profiles  elefined  by 

(Ijj  (\/7Tj7  \/Pl  I7  )”  ■}"...•+•  (  \/ P)i\i  ~  yj Ps I  j  ) ~  (^*^4) 

which  depends  only  e)n  the  z-th  and  j-th  column  profiles.  In  such  a  ciise,  the  Euclidean  elistance  in  the 
reduced  space  bedAveen  the  z-th  and  j-th  e:e)iumn  profilers  is  an  approximation  to  (4.14).  For  the  derivation 
e)f  canonical  coordinates  of  the  column  profiles  (considcTcd  cis  population)  we  choose 

\//h  1 1  •  •  •  \/P\[m 

1  •  •  •  \/Ps\rn 

M  =  /,  W  —  C  —  Diag  (zz.i/zi, . . . 

and  e:onsider  the  s.v.el. 

(.V  -  OOC''-  =  A,  (/,  v;  +  . . .  +  XMsV.:.  (4.15) 

\V(^  may  e:hoe)se  =  (6  ?  •  •  • 

~  s/Pi  -  y/ui  /ji,  e)r  (4.16) 

(^^1  \/Pi\  I  4"  ...  “f-  71.7,1  y/pj^m)  •  (4.17) 

The  e:anonie:al  coe)rdinate\s  in  fe)r  the  e:olumn  profile's  choosing  ^  as  in  (4. 16)  or  (4.17)  are 

,A,C-'/“14  (4.18) 

where  the  components  of  the  z-th  v(H:tor  are  the  e:()e)rdinates  of  the  ni  e:olumn  (population)  profiles  in  the 
z-th  dimension.  The  standardizeel  coordinates  in  fe)r  the  variables,  i.e.,  the  row  e:ategories,  obtained 
i\s  described  in  (3.15)  from  the  same  s.v.el.  as  in  (4.15)  arc 

AiA”^f/i,A2A-^f/.>,...  (4.19) 

where  A  is  a  diagonal  matrix  with  the  z-th  eliagemal  element  as  the  square  root  of  the  z-th  eliagonal 
element  of 

x'iui  [/;  +  ...  +  x'iuMs  -  (x  -  ^r)C{x  -  (4.20) 

The  .s  components  of  XiA~^Ui  in  (4.19)  are  the  ce)ordinates  of  the  variables  in  the  z-th  dimension. 

It  can  be  shown  that  the  statistic 

4n(Af  +  ...-hA’f)  (4.21) 

is  elistributed  asymptotically  as  chi.seiiiare  on  (.s  —  I )(///-  —  1)  degrees  of  freedom  to  test  independence  in 
the  two  way  contingency  table.  Further,  hypotheses  specifying  the  dimensions  of  the  subspace  in  which 
the  profiles  can  be  represented  can  also  be  tested  in  the  same  way  as  in  (4.13)  using  the  residual  singular 
values. 


10 


C.  RADHAKRISHNA  RAO 


The  advantages  in  using  HD  between  profiles  are  the  following. 

1.  The  measure  depends  only  on  the  profiles  of  the  concerned  pair.  It  is  not  altered  when  an  extended 
set  of  profiles  is  considered. 

2.  The  measure  does  not  depend  on  the  sample  sizes  on  which  the  profiles  are  estimated. 

3.  If  a  representation  of  the  row  profiles  is  also  needed  we  take  X  =  sqTt{Q^),  i.e.,  the  elements  of  X 
are  the  square  roots  of  the  elements  of  Q'  where  Q  is  the  matrix  defined  in  (4.2)  and  compute  the  s.v.d. 

(A  —  7/1^)/?^^“  —  ^i\A\B\  +  . . .  4-  (4.22) 

kiading  to  the  canonical  coordinates  for  row  profiles 


The  corresponding  standardized  coordinates  for  the  columns  considered  as  variable  are 

where  Ap  is  the  diagonal  matrix  with  i-th  diagonal  element  as  tlie  scjuarc  root  of  the  i-th  diagonal  element 
of 

ft]  A I  A\  4-  ...  4-  filAsA',f. 

4.  If  w(^  choose  ^  as  in  (4.1G),  then  the  matrix  in  (4.15)  is 


V  71 

71 

which  is  symmetric  in  i  and  j.  Then,  tlie  same  s.v.d.  as  in  (4.15)  could  be  used  for  computing  tlic 
canonical  coordinates 

A 1  /?- * /- t/i ,  A-, ‘  1/2 , . . .  ,  At ‘ [/fc 
for  the  row  profiles,  <is  in  the  ciise  of  CA. 


Example  4.1. 

We  consider  the  data  (from  Greenacre  (1993))  (Hi  796  scientific  rc^searchers  chissificd  according  to 
their  scientific  discipline  (as  populations)  and  funding  category  (as  variables)  as  shown  in  Table  3. 


Tahi.K  3.  Scientific  disciplines  l)y  research  funding  categories 


Scientific  discipline  Funding  category  Total 
a  b  c  (1  i) 


Geology 

G 

3 

19 

39 

14 

10 

85 

Biochemistry 

B, 

1 

2 

13 

1 

12 

29 

Chemistry 

C 

6 

25 

49 

21 

29 

130 

Zoology 

Z 

3 

15 

41 

35 

2G 

120 

Physics 

P 

10 

22 

47 

9 

26 

114 

Engineering 

E 

3 

11 

25 

15 

34 

88 

Microbiology 

Mx 

1 

6 

14 

5 

11 

37 

Botany 

Bi 

0 

12 

34 

17 

23 

86 

Statistics 

S 

2 

5 

11 

4 

7 

29 

Mathematics 

A/i 

2 

11 

37 

8 

20 

78 

Total 

31 

128 

310 

129 

198 

796 

The  canonical  coordinates  for  the  scientific  discipliiuis  (considered  as  populations)  in  tlie  first  three 
dimensions  and  percentage  of  variance  explained  by  (^ach  arc  given  in  Table  4  for  the  analyses  based  on 
the  chisquare  distance  (correspondence  analysis)  and  the  Hellinger  distance  (alternative).  The  formula 
(4,10)  is  used  for  the  analysis  based  on  chisquare  and  the  formula  (4.15)  for  that  based  on  Hellinger 
distance.  For  Hellinger  distance  analysis,  the  central  point  is  chosen  according  to  the  formula  (4.17). 
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Table  4.  Canonical  coordinates  for  the  scientific  disciplines  in  the  first  three  dimensions 
Subjects  Chisquare  Distance  Hellinger  distance 


diml 

(lim2 

dim3 

diml 

dim2 

dim3 

G 

.076401 

.302569 

-.087749 

-.031140 

.167408 

-.048245 

Bi 

.179892 

-.454996 

-.151716 

-.129374 

-.242174 

-.077614 

C 

.037644 

.073353 

.042371 

-.021144 

.040433 

.028254 

Z 

-.327365 

.102283 

.064515 

.138850 

.045255 

.056894 

P 

.315552 

.026997 

.108688 

-.165340 

.010679 

.023844 

E 

-.117495 

-.291712 

.107330 

.049451 

-.129900 

.082901 

M, 

.012766 

-.109656 

-.041435 

-.004913 

-.052588 

-.008439 

Bo 

-.178695 

-.038501 

-.129055 

.151404 

-.036559 

-.108025 

S 

.124638 

.014162 

.107190 

-.006639 

.011703 

.052571 

Mo 

.106751 

-.061316 

-.175688 

-.050307 

-.037572 

-.078000 

%  var. 

47.20 

36.66 

13.11 

45.87 

34.10 

10.57 

The  plots  of  the  scientific  disciplines  (subjects)  using  the  canonical  coordinates  based  on  the  chisquare 
and  Hellinger  distances  arc  given  in  Figures  1  and  2  respectively.  The  coordinates  in  the  third  dimension 
are  plotted  on  a  line  on  the  right  hand  side  of  the  two  dimensional  plot.  This  will  be  of  help  in  visualizing 
the  plot  in  three  dimensions  and  in  interpreting  the  distances  in  the  two  dimensional  plot.  Thus,  although 

RBd  E  appear  to  be  close  to  each  other  in  the  two  dimensional  chart,  they  are  clearly  separated  in 
the  third  dimension.  No  additional  distances  in  the  third  dimension  are  involved  in  the  case  of  P,  C,  5,  Z 
and  E. 

It  is  of  interest  to  note  in  tliis  example  that  the  configuration  of  the  scientific  disciplines  in  three 
dimensions  obtained  by  both  the  methods  are  very  similar.  The  percentage  variance  explained  in  each 
dimension  is  nearly  the  same  for  both  the  methods. 

The  standardized  canonical  coordinatcjs  for  the  funding  categories  (considered  as  variables)  arc  com¬ 
puted  using  the  formula  (4.12)  for  the  chisciuare  analysis  and  the  formula  (4.19)  for  the  Hellinger  distance 
analysis.  These  are  obtained  from  tlie  same  s.v.d.  used  to  compute  the  canonical  coordinates  for  the 
scientific  disciplines.  Table  5  gives  the  standardized  canonical  coordinates  for  the  funding  categories,  a, 
b,  c,  d,  e,  using  the  two  methods. 

Table  5.  Standardized  canonical  coordinates  for  funding  categories 
(variables)  in  the  first  three  dimensions 

Funding  Chisciuare  Distance  Hellinger  Distance 

category 


diml 

dim2 

dim3 

%var 

diml 

dim2 

dim3 

%var 

a 

.758 

.114 

-.619 

97.1 

-.796 

-.164 

-.573 

98.9 

1) 

.535 

.728 

-.137 

83.5 

-.438 

-.766 

-.008 

77.9 

c 

.583 

.352 

.094 

94.6 

-.501 

-.327 

.759 

93.4 

d 

-.426 

.331 

-.172 

99.8 

.888 

-.358 

-.285 

99.7 

e 

-.108 

-.909 

-.081 

99.6 

.088 

.978 

-.159 

98.9 

The  standardized  canonical  coordinates  for  the  funding  categories  arc  plotted  in  Figure  3  (for 
chisciuare  distance)  and  in  Figure  4  (for  Hellinger  distance).  It  may  l)e  noted  that  all  the  points  lie 
within  the  unit  circle.  It  is  customary  to  represent  the  canonical  coordinates  for  the  subjects  and  vari¬ 
ables  in  one  chart.  We  are  using  separate  charts  in  order  to  explain  the  salient  features  of  the  configuration 
of  the  variables.  The  following  interpretations  emerge  from  the  study  of  Table  5  and  Figures  3  and  4. 
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1.  The  configurations  of  the  funding  categories  as  shown  in  Figures  3  and  4  obtained  by  using 
chisquare  and  Hellinger  distances  are  very  similar. 

This  is  not  generally  the  case,  although  in  some  examples  studied  by  the  author  a  good  deal 
of  robustness  was  observed  in  the  choice  of  the  distance  measure  and  relative  sample  sizes  for 
the  populations  under  study.  However,  addition  or  deletion  of  some  populations  may  affect  the 
configuration  of  the  populations  when  correspondence  analysis  is  used.  Example  4.2  discussed 
below  throws  more  light  on  tlu)  problem  and  shows  that  the  analysis  based  on  Hellinger’s  distance 
is  more  robust  to  relatives  sample  sizes. 

2.  All  most  all  the  variation  in  the  funding  categories  a,  d  and  e  is  captured  in  the  first  three  canonical 
coordinates  of  the  scientific  disciplines.  A  large  percentage  of  variation  in  b  and  c  is  explained  by 
the  first  three  coordinat(\s. 

3.  The  first  dimension  is  strongly  infhuuiced  by  a,d,  the  second  dimension  by  /;,  c  and  the  third 
dimension  by  a,  c. 

riius  the  use  of  standardized  coordinate's  for  variables  enables  us  to  interpret  the  different  dimensions 
in  terms  of  obsc^rved  variables.  There  are  other  ways  of  plotting  the  coordinates  of  the  variables  as 
mentioned  in  the  paragraphs  l)(4ow  Tabki  2.  Such  biplots  having  a  different  interpretation  are  discussed 
in  Gabriel  (1971),  Gifi  (1990),  CUnwr  (1993)  and  Greenacre  (1993). 

Notr.  4 >2.  In  computing  the  canonical  coordinates  based  on  Helling(^r  distance  (HD)  using  the  formula 
(4.15),  we  chose  the  relative  sample  sizes  as  the  weights  to  lx?  attached  to  the  populations.  We  could 
have  chosen  an  alternative  set  of  weights  if  we  wanted  distances  Ix^tween  a  specified  subset  of  populations 
to  be  better  preserved  in  the  reduced  space  than  the  others.  In  particular,  we  could  have  chosen  uni¬ 
form  weights  for  ail  populations.  In  fact  such  an  option  could  b(^  (ixercised  if  the  sample  sizes  of  different 
populations  were  widely  different.  Unfortunately  no  such  options  are  available  in  correspondence  analysis. 


Example  4.2. 

In  tlie  (‘xampk^  4.1,  tlu'ix*  was  a  p(‘rf('ct  match  betw(‘eii  the  plots  basc'd  on  CA  and  HD.  This  i)robably 
d(!m()nstrat(\s  that  t\u)  iiKithod  of  (kaivation  of  canonical  coordinates  is  somewhat  robust  to  the  choice  of 
the  distance  irxiasure  as  well  as  to  tlu!  wcaghts.  However  the  choice  of  HD  provides  an  insurance  against 
possible!  distortion  due  to  variations  in  sainpk!  siz(!s  for  tlx!  [)0{)ulations  as  the  following  (example  shows. 

Table  G,  reproduced  from  Gifi  (1990),  gives  the  distributions  of  the  pages  devoted  to  different  topics 
denoted  by  .4,  i?,  C,  D,  E,  E  and  G  in  20  books  on  Multivariate  analysis  designated  as  a,  6, . . ,  ,t.  Gifi 
(1990)  did  correspondence  analysis  on  the  data  and  drew  some  conclusions  b^ised  on  the  first  three 
canonical  coordinates  which  (!xplain  a  high  percentage  of  variation.  The  first  three  canonical  coordinates 
for  the  profiles  of  the  books  boused  on  CA  and  HD  approaches  are  given  in  Table  7. 

It  may  be  noted  that  the  total  nuinbc^r  of  pages  of  a  book  depends  on  the  font  size  of  the  print,  while 
its  profile  in  terms  of  proportions  of  pages  used  on  different  topics  remain  the  same  for  all  sizes.  Table  8 
gives  the  data  on  books  having  the  same  profiles  <is  in  Table  6  with  the  total  number  of  pages  altered  for 
the  books  d, /,  ry,/t,j  and  n. 

riu!  thr('(!  diiiK^nsioiial  canonical  coordinatcis  bas('d  on  CA  and  HD  approaches  are  given  in  Table  9. 
l;sing  th(!  coordinates  one  can  obtain  tlie  mutual  distances  between  the  books  in  the  three  dimensional 
ieduc(Hl  Euclidean  space.  Figure!  5  gives  a  plot  comparing  the  s([uared  distances  between  books  based 
on  CA  using  the  data  of  Tables  G  and  8.  Figure  G  gives  the  corresponding  plot  for  the  squared  distances 
!)ased  on  the  HD  approach.  It  is  se(!n  that  the  throe  dimensional  representation  of  the  data  of  Tables  G 
and  8  are  more  similar  uiid(!r  HD  analysis  than  that  under  CA.  The  r(!lative  positions  of  the  books  are 
influenc(!d  by  the  font  size  in  printing  wh(!n  CA  is  used,  although  the  profiles  of  the  books  arc  not  altered. 
Then!  app(!ars  to  be  greater  stability  with  the  HD  analysis  which  provides  insurance  against  different 
choice  of  sample  sizes.  Further,  one  can  exercise  the  option  of  using  a  common  weight  for  all  the  books 
in  the  HD  analysis  when  the  differences  in  book  sizes  are  large. 
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Table  G.  Number  of  pages  by  topics 


Books 

A 

B 

c 

D 

E 

F 

G 

a 

31 

0 

0 

0 

0 

164 

11 

b 

0 

16 

54 

18 

27 

13 

14 

c 

0 

40 

32 

10 

42 

60 

0 

(1 

19 

0 

35 

19 

28 

163 

52 

0. 

14 

7 

35 

22 

17 

0 

56 

f 

20 

G9 

72 

33 

55 

0 

32 

g 

74 

0 

86 

14 

0 

84 

48 

h 

78 

0 

80 

5 

17 

105 

60 

i 

74 

19 

33 

12 

26 

0 

0 

j 

80 

G8 

67 

15 

29 

0 

0 

k 

108 

48 

1 

10 

46 

108 

0 

1 

109 

13 

5 

17 

39 

32 

46 

in 

IG 

35 

69 

24 

0 

26 

41 

n 

2G 

8G 

60 

6 

48 

48 

28 

0 

290 

LO 

6 

0 

8 

0 

2 

184 

48 

82 

42 

134 

0 

0 

<I 

29 

0 

0 

0 

41 

211 

32 

r 

0 

19 

56 

0 

39 

75 

0 

s 

0 

22 

45 

42 

60 

230 

59 

t 

30 

128 

90 

28 

48 

0 

0 

Table  7.  Canonical  coordinates 


Chis(|tiar(^  Distance  Hclliiiger  Distance 


dim  1 

dim  2 

dim  3 

dim  1 

dim  2 

<lim  3 

a 

-1.10857 

-0.61445 

-0.33902 

0.64632 

0.36299 

0.12879 

b 

0.07397 

0.70254 

0.25265 

-0.01661 

-0.48923 

-0.12388 

c 

-0.21153 

0.46054 

-0.49228 

0.10998 

-0.42185 

0.32822 

(1 

-0.77795 

-0.11074 

0.15556 

0.46658 

-0.01597 

-0.10284 

0. 

0.02781 

0.40651 

1.06135 

-0.19193 

-0.15180 

-0.45570 

f 

0.35780 

0.69602 

0.09284 

-0.37016 

-0.29359 

-0.14451 

g 

-0.16412 

-(]. 15719 

0.46353 

0.23979 

0.16911 

-0.35829 

h 

-0.25023 

-0.19626 

0.39002 

0.26103 

0.14730 

-0.23804 

i 

0.72788 

-0.19452 

-0.04749 

-0.50899 

0.14292 

0.0293G 

J 

0.68403 

0.24337 

-0.17956 

-0.53320 

-0.01242 

0.04724 

k 

0.02729 

-0.36648 

-0,44297 

0.03996 

0.21098 

0.36189 

1 

0.26802 

-0.44749 

0.28287 

-0.00524 

0.27070 

-0.06481 

in 

0.02188 

0.50893 

0.51719 

0.01506 

-0.19266 

-0.34080 

n 

0.12052 

0.48459 

-0.19476 

-0.04555 

-0.20966 

0.04945 

0 

1.08308 

-1.32602 

0.03206 

-0.39476 

0.66357 

-0.00090 

P 

0.64959 

-0.07081 

-0.13268 

-0.49299 

0.09097 

0.08510 

q 

-0.98347 

-0.39273 

-0.25019 

0.58910 

0.21379 

0.19442 

r 

-0.40006 

0-32919 

-0.33826 

0.21605 

-0.35929 

0.28764 

s 

-0.74726 

0.08101 

-0.00508 

0.43349 

-0.30134 

0.03139 

t 

0.56547 

0.81454 

-0.35256 

-0.51167 

-0.27874 

0.101C2 
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Table  8.  Number  of  pages  by  topics 


Books 

A 

B 

C 

D 

E 

F 

G 

a 

31 

0 

0 

0 

0 

1C4 

11 

b 

0 

16 

54 

18 

27 

13 

14 

c 

0 

40 

32 

10 

42 

GO 

0 

(1 

190 

0 

350 

190 

280 

1G30 

520 

(‘ 

14 

7 

35 

22 

17 

0 

56 

f 

10 

34 

36 

17 

28 

0 

16 

g 

740 

0 

860 

140 

0 

840 

480 

h 

780 

0 

800 

50 

170 

1050 

600 

i 

74 

19 

33 

12 

2G 

0 

0 

j 

40 

34 

33 

8 

15 

0 

0 

k 

108 

48 

4 

10 

4G 

108 

0 

1 

109 

13 

5 

17 

39 

32 

46 

in 

1C 

35 

69 

24 

0 

2G 

41 

n 

13 

43 

30 

3 

24 

24 

14 

o 

290 

10 

6 

0 

8 

0 

2 

P 

184 

48 

82 

42 

134 

0 

0 

q 

29 

0 

0 

0 

41 

211 

32 

r 

0 

19 

56 

0 

39 

75 

0 

s 

0 

22 

45 

42 

GO 

230 

59 

t 

30 

128 

90 

28 

48 

0 

0 

Table  9.  Canonical  coordinates 
Cliisciuare  Distance  Hcllinger  Distance 


(lini  1 

dim  2 

dim  3 

(Urn  1 

dim  2 

dim  3 

a 

-().G2310 

0.30413 

-0.53463 

-0.35082 

-0.04632 

0.42925 

1) 

0.63345 

0.41500 

0.44316 

0.25096 

-0.37625 

-0.40565 

c 

0.9048G 

0.78379 

-0.17802 

0.22985 

-0.60374 

-0.08540 

(1 

-0.36427 

0.36470 

-0.12611 

-0.23454 

-0.16742 

0.01739 

(j 

0.20621 

0.05647 

0.54414 

0.34853 

0.01585 

-0.32928 

f 

1.23299 

0.45214 

0.27035 

0.59591 

-0.20402 

-0.28683 

g 

-0.16974 

-0.27626 

0.25537 

-0.08445 

0.24917 

-0.09573 

h 

-0.18352 

-0.18729 

0.09783 

-0.07908 

0.11818 

0.00148 

i 

0.85607 

-0.49586 

-0.21672 

0.73058 

0.07206 

0.07026 

j 

1.35943 

-0.06808 

0.07459 

0.77422 

-0.01788 

-0.04783 

k 

0.61122 

0.01365 

-0.60327 

0.28646 

-0.20830 

0.40764 

1 

0.32350 

-0.41447 

-0.39537 

0.23929 

0.02213 

0.24724 

m 

0.58680 

0.20860 

0.65792 

0.17707 

0.01371 

-0.36016 

n 

1.20448 

0.51816 

0.07444 

0.32243 

-0.27238 

-0.09222 

() 

0.47616 

-1.60929 

-0.73075 

0.61206 

0.41339 

0.46017 

P 

0.87199 

-0.26044 

-0.37985 

0.72806 

-0.03491 

0.09187 

q 

-0.44497 

0.44631 

-0.57047 

-0.27936 

-0.25639 

0.41829 

r 

0.34209 

0.56913 

-0.04981 

0.10278 

-0.49844 

-0.07991 

s 

-0.10036 

0.60181 

-0.15749 

-0.14779 

-0.45659 

-0.10069 

t 

1.87687 

0.57376 

0.28038 

0.78353 

-0.24748 

-0.19823 

Distance  based  on  altered  values  Distance  based  on  altered  values 
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Distance  based  on  original  data 

Figure  5.  A  comparative  plot  of  squared  distances 
between  all  pairs  of  books  in  the  reduced  spaces 
based  on  correspondence  analysis 
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Distance  based  on  original  data 

Figure  6.  A  comparative  plot  of  squared  distances 
between  all  pairs  of  books  in  the  reduced  spaces 
based  on  Hellinger  distance  analysis 
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5.  Concluding  remarks 

A  general  theory  is  developed  for  plotting  high  dimensional  “population  by  variable”  data,  i.e.  mea¬ 
surements  made  on  a  set  of  characteristics  of  given  populations,  in  a  low  dimensional  Euclidean  space.  A 
first  step  in  such  a  problem  is  the  specification  of  the  basic  metric  space  in  which  the  populations  can  be 
represented  as  points  using  the  entire  data,  and  a  characterization  of  the  configuration  of  the  points  in 
terms  of  distances  between  points.  The  second  is  the  development  of  methodology  for  transforming  the 
points  from  the  basic  space  to  a  low  dimensional  Euclidean  space  with  the  usual  definition  of  distance 
preserving  the  configuration  of  points  to  the  extent  possible.  The  choices  of  the  basic  space  and  the 
distance  function  between  points  have  to  be  made  on  practical  considerations  depending  on  the  problem 
under  investigation.  A  closed  form  solution  is  obtained  when  the  basic  space  is  a  vector  space  endowed 
with  an  inner  product  and  the  associated  norm.  Some  examples  are  given  involving  measurements  on 
discrete  variables. 

When  we  have  data  in  the  form  of  frequencies  of  individuals  of  a  population  under  different  categories 
of  an  attribute,  a  well  known  method  for  dimensionality  reduction  for  representing,  say  the  populations, 
is  correspondence  analysis.  The  bjisic  space  in  this  case  is  a  vector  space  where  each  population  is 
represented  by  the  vector  of  relative  frequencies  of  the  different  categories  of  an  attribute  and  distance 
between  vectors  is  defined  by  a  chisquare  type  formula.  Such  a  distance  function  is  not  an  intrinsic  measure 
of  difference  between  two  populations  Jis  it  depends  not  only  on  the  differences  between  their  relative 
frequencies,  but  also  on  the  average  relative  frequencies  computed  from  the  set  of  populations  under  study. 
Thus  the  configuration  of  any  subset  of  populations  depends  on  what  other  populations  are  included  in  the 
analysis,  and  also  on  the  relative  numbers  of  individuals  observed  from  each  population.  An  alternative 
approach  of  representing  a  population  by  the  vector  of  the  square  roots  of  relative  frequencies  and  defining 
distance  between  two  populations  by  the  Hellinger  formula  does  not  have  the  drawbacks  associated  with 
the  chisquare  type  formula.  In  addition,  the  new  analysis  has  the  same  advantage  of  providing  tests 
of  significance  for  homogeneity  of  the  populations  as  in  correspondence  analysis  based  on  the  chisquare 
formula. 

It  may  be  contended  that  CA  is  meant  to  be  used  for  the  analysis  of  contingency  tables  with  di¬ 
chotomized  data  using  two  attributes  like  hair  color  and  eye  color  (cis  originally  demonstrated  by  R.A. 
Fisher),  and  not  for  the  analysis  of  population  by  variable  data  where  anomalies  of  the  type  described  in 
the  paper  may  occur.  However,  one  finds  in  published  literature  more  examples  of  the  latter  type  of  data 
analyzed  through  CA.  Further,  even  with  attribute  data,  if  the  configurations  of  the  column  (or  row) 
profiles  for  two  different  populations  (with  possibly  different  marginal  distributions)  are  to  be  compared, 
HD  analysis  is  more  appropriate  than  the  CA.  It  is  the  author’s  opinion  that  the  choice  of  a  distance 
measure  between  populations  (row  or  column  profiles)  must  depend  on  the  nature  of  the  data  and  the 
purpose  of  analysis.  Prescription  to  use  a  particular  distance  cis  in  the  CA  in  all  problems  may  be  mislead¬ 
ing.  Distance  measures  other  than  the  chisquare  and  Hellinger  types  may  be  more  appropriate  in  some 
situations.  For  a  purely  exploratory  data  analysis,  it  is  possible  that  a  wide  variety  of  distance  measures 
reveal  similar  configurations  of  the  populations  in  terms  of  clustering  and  inter  cluster  relationships. 

Between  the  choices  of  chisquare  and  Hellinger  distances,  the  latter  seems  to  offer  some  advantages, 
fXS  the  latter  has  similar  theoretical  properties  i\s  the  former  and  in  addition  it  is  defined  as  an  intrinsic 
function  of  two  population  profiles  independent  of  what  other  populations  are  included  in  a  study. 

A  recent  technical  report  by  Rios,  Villarroya  and  Oiler  (1994)  discusses  the  same  problem  as  in 
the  present  paper,  viz.,  simultaneous  representation  of  populations  and  random  variables,  under  the 
assumption  of  an  underlying  parametric  model. 

The  method,  referred  there  as  Intrinsic  Data  Analysis^  is  based  on  the  Riemannian  structure  given 
by  the  Fisher  information  metric  and  its  corresponding  distance,  the  Rao  distance.  The  statistical  popu¬ 
lations  are  viewed  as  points  on  a  Riemannian  manifold  and  the  random  variables  with  finite  expectation, 
as  vector  fields,  namely,  the  gradient  of  the  random  variable  mean  value,  or,  by  integration,  a  bundle  of 
curves  on  the  manifold. 

Then,  assuming  certain  additional  regularity  conditions,  a  reference  point  on  the  manifold  is  selected 
as  the  statistical  populations  Riemannian  center  of  mass,  and  the  points  representing  the  populations 
and  the  curves  representing  the  variables  are  mapped,  through  the  inverse  of  the  Riemannian  exponential 
map,  into  the  tangent  space  at  the  center  of  mass,  which  has  a  Euclidean  vector  space  structure.  Then, 
classical  dimension  reduction  techniques  such  as  principal  component  analysis  can  be  used  to  obtain  a 
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low  dimensional  Euclidean  space  which  allows  an  optimal  population  representation.  Finally,  the  curves 
in  the  tangent  space  are  projected  into  the  low  dimensional  space  obtained. 

This  method  is  applied  to  multivariate  normal  and  multinomial  distributions.  In  the  multinomial 
case,  the  Rao  distance,  p,  between  two  populations  Pi,  • .  •  ,p„  and  <71, . . .  is  proportional  to  the 
Bhattacharyya  distance 

n 

p  =  lavccos  ^  y/PjQj 

;=i 

which  is  a  monotone  transformation  of  the  Hellinger  distance,  and  thus  this  method  will  share  some 
properties  with  the  latter. 


References 

1.  J.P.  Benzccri,  Correspondence  Analysis  Handbook,  Marcel  Dekkar,  Inc.,  New  York,  (1992). 

2.  L.L.  Cavallis-Sforza,  Genes,  peoples  and  languages.  Scientific  American,  265  (1991),  104-110. 

:i.  II,  C:hernoff,  The  use  of  faces  to  represent  points  in  k- dimensional  space  graphically,  J.  Arncr.  Statist.  Assoc.  68  (1973), 
361-3G8. 

4.  C.  Eckart  and  G.  Youn^,  The  approximation  of  one  inatnx  by  another  of  lower  rank,  P.sychometrika  1  (1936),  211-308. 

5.  G.  Eslava-Gomoz  and  F.II.C.  Marriott,  Criteria  to  represent  groups  in  the  plane  when  the  grouping  is  unknown. 
Biometrics  49  (1993),  1088-1098. 

6.  K.R.  Gabriel,  The  biplot  graphical  display  of  matrices  with  applications  to  principal  component  analysis,  Biometrika, 
58  (1971),  453-467. 

7.  A.  Gifi,  Nonlinear  Multivariate  Analysis,  New  York:  John  Wiley  (1981,  1990). 

8.  J.C.  Gower,  Recent  advances  in  biplot  methodology,  In  Multivariate  Analysis:  lAjture  Directions  2  (Eds.  (.;.M.  Cuadras 
and  C.R.  Rao),  North  Holland  (1993),  295-325. 

9.  M.J.  Greenacre,  Theory  and  Applications  of  Correspondence  Analysis,  London:  Academic  Press  (1984). 

10-  _ ,  Biplots  in  correspondence  analysis,  J.  Applied  Statistics  20  (1993),  251-269. 

11.  J.B.  Kruskal  and  M.  Wish,  Multidimensional  Scaling.  Sage  Publications  (1978). 

12.  P.C.  Mahalanobis,  On  the  generalized  distance  in  statistics,  Proc.  Nat.  Inst.  Sci.,  India  12  (1936),  49-55. 

13.  P.C.  Mahalanobis,  D.N.  Mazumdar.  and  C.R.  Rao,  Anthropometric  survey  of  United  Provinces,  lO^t.  A  statistical 
study,  Sankhya  9  (1949),  90-324. 

14.  S.  Nisliisato,  Analysis  of  Categorical  Data:  Dual  Scaling  and  its  Applications,  University  of  Toronto  Press,  Toronto, 
Canada  (1980). 

15.  C.R.  Rao,  Information  and  accuracy  attainable  in  the  estimation  of  statistical  parameters.  Bull.  Cal.  Math.  Soc.  37 
(1945),  81-91. 

^6.  _  The  problem  of  classification  and  distance  between  two  populations.  Nature  159  (1947),  30. 

1^7.  _ ,  The  utilization  of  multiple  measurements  in  problems  of  biological  classification  (with  discussion),  J.  Roy.  Statist. 

Soc.  Series  BIO  (1948),  159-193. 

18.  _ ,  Some  statistical  methods  for  comparison  of  growth  curves,  Biometrics  14  (1958),  1-17. 

1^^-  _ ,  T'he  use  and  interpretation  of  principal  component  analysis  in  applied  research,  Sankhya  26  (1964),  329-357. 

20.  _ ,  Linear  Statistical  Inference  and  its  Applications,  2nd  Edition,  New  York  :  Wiley  (1973). 

^1*  _ >  Separation  theorems  for  singular  values  of  matrices  and  their  applications  in  multivariate  analysis,  J.  Multivariate 

Analysis  9  (1979),  362-377. 

22.  _ ,  Matrix  approximations  and  reduction  of  dimensionality  in  multivariate  statistical  analysis.  In  Multivariate  Anal¬ 

ysis  V  (Ed.  P.R.  Krishnaiah),  Amsterdam:  North  Holland,  (1980),  3-22. 

28.  _ ,  Tests  for  dimensionality  and  ijiteractioTi  of  mean  vectors  under  general  and  reducible  covariance  structures,  J. 

Multivariate  Analysis  16  (1985),  173-184. 

24.  _ ,  Prediction  of  future  observations  in  growth  curve  type  models,  J.  Statistical  Science  2  (1987),  434-471. 

25.  M.  Rios,  A.  Villarroya  and  J.M.  Oiler,  Intrinsic  Data  Analysis:  A  method  for  the  simultaneous  representation  of 
populations  and  variables,  Mathematics  preprint  series  160  (1994),  Universitat  de  Barcelona. 

26.  J,  Von  Neumann,  Some  matrix  inequalities  and  metrization  of  metric  spaces,  Tomsk.  Univ.  Rev.  1  (1937),  286-299. 

27.  IC.J.  Wegman,  Ilypcrdimensiojial  data  analysis  using  parallel  coordinates,  J.  Amer.  Statist.  Assoc.  85  (1990),  664-675. 

E-mail  address:  crrlflpsuvm.psu. edu 

Center  for  Multivariate  Analysis,  Department  of  Statistics,  Pennsylvania  State  University  University 
Park,  PA  16802,  USA 


